Search CORE

10 research outputs found

Creating and Exploiting Annotated Corpora

Author: Vidová Hladká Barbora
Publication venue
Publication date: 20/05/2021
Field of study

Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

Author: Hana Jiří
Jelínek Tomáš
Rosen Alexandr
Vidová Hladká Barbora
Škodová Svatava
Štindlová Barbora
Publication venue: 'Charles University in Prague, Karolinum Press'
Publication date: 01/01/2020
Field of study

Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

CU Digital Repository

Directory of Open Access Books (DOAB)

Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

Author: Hana Jiří
Jelínek Tomáš
Rosen Alexandr
Vidová Hladká Barbora
Škodová Svatava
Štindlová Barbora
Publication venue: Praha
Publication date: 01/10/2020
Field of study

Žákovské korpusy, čili korpusy, které dokumentují jazyk tak, jak jej používají nerodilí mluvčí, poskytují důležité informace pro výzkum osvojování jazyka i pedagogickou praxi. Tato monografie představuje CzeSL – korpus češtiny nerodilých mluvčích, a to na pozadí teoretických a praktických otázek současného výzkumu v oboru žákovských korpusů. Jazyky s bohatou morfologií a volným slovosledem, včetně češtiny, jsou pro analýzu osvojovaného jazyka obzvláště náročné. Autoři se zabývají složitostí chybové anotace a popisují tři vzájemně se doplňující anotační schémata. Věnují se také popisu nerodilé češtiny z hlediska standardních jazykových kategorií. Kniha podrobně rozebírá praktické aspekty tvorby korpusu: proces sběru a anotace, potřebné nástroje, výsledná data, jejich formáty a vyhledávací rozhraní. Kapitola o aplikacích korpusu ilustruje jeho užitečnost pro výuku, výzkum akvizice i počítačovou lingvistiku. Každý, kdo se zabývá tvorbou žákovských korpusů, jistě ocení závěrečnou kapitolu, shrnující úskalí, kterým je třeba se vyhnout.Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

CU Digital Repository

Tour de CLARIN Volume One

Tour de CLARIN is an initiative started by CLARIN ERIC in 2016 that has been periodically highlighting prominent user involvement activities of CLARIN national consortia in the form of blog posts published on the CLARIN webpage, disseminated through the CLARIN news flash and on social media. By focusing a different national consortium every two months and showcasing their outstanding language resources, text processing tools, user involvement events and researchers, we have been aiming to increase the visibility of the various consortia, reveal the richness of the CLARIN landscape, and display the full range of activities throughout the network that can not only inform and inspire other consortia, but also show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms. In the two years we have been running the initiative, and having visited nearly half of all the CLARIN member countries, we can say that Tour de CLARIN has proved to be one of the flagship user involvement initiatives by CLARIN ERIC; highly valuable for our network and incredibly popular with our readers. This is why have decided to collect the blog posts in a printed volume. The first volume presents all the nine countries which we have visited so far: Finland, Sweden, Austria, the Netherlands, Poland, Belgium, the Czech Republic, Greece and Lithuania

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Prague Dependency Treebank 3.5

Author: Bejček Eduard
Buráňová Eva
Bémová Alevtina
Hajič Jan
Hajičová Eva
Havelka Jiří
Homola Petr
Kettnerová Václava
Klyueva Natalia
Kolářová Veronika
Kučová Lucie
Kárník Jiří
Lopatková Markéta
Mikulová Marie
Mírovský Jiří
Nedoluzhko Anna
Pajas Petr
Panevová Jarmila
Poláková Lucie
Rysová Magdaléna
Sgall Petr
Spoustová Johanka
Straňák Pavel
Synková Pavlína
Urešová Zdeňka
Vidová Hladká Barbora
Zeman Daniel
Zikánová Šárka
Ševčíková Magda
Štěpánek Jan
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 19/02/2018
Field of study

The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts, i.e., all annotation from PDT 1.0, PDT 2.0, PDT 2.5, PDT 3.0, PDiT 1.0 and PDiT 2.0, plus corrections, new structure of basic documentation and new list of authors covering all previous editions. The Prague Dependency Treebank 3.5 (PDT 3.5) contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (832,823 words) on all layers, from tectogrammatical annotation to syntax to morphology. There are additional annotated sentences for syntax and morphology; the totals for the lower layers of annotation are: 87,913 sentences with 1,502,976 words at the analytical layer (surface dependency syntax) and 115,844 sentences with 1,956,693 words at the morphological layer of annotation (these totals include the annotation with the higher layers annotated as well). Closely linked to the tectogrammatical layer is the annotation of sentence information structure, multiword expressions, coreference, bridging relations and discourse relations

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Prague Dependency Treebank 3.5

Author: Bejček Eduard
Buráňová Eva
Bémová Alevtina
Hajič Jan
Hajičová Eva
Havelka Jiří
Homola Petr
Kettnerová Václava
Klyueva Natalia
Kolářová Veronika
Kučová Lucie
Kárník Jiří
Lopatková Markéta
Mikulová Marie
Mírovský Jiří
Nedoluzhko Anna
Pajas Petr
Panevová Jarmila
Poláková Lucie
Rysová Magdaléna
Sgall Petr
Spoustová Johanka
Straňák Pavel
Synková Pavlína
Urešová Zdeňka
Vidová Hladká Barbora
Zeman Daniel
Zikánová Šárka
Ševčíková Magda
Štěpánek Jan
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 19/02/2018
Field of study

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

Author: Bejček Eduard
Buráňová Eva
Bémová Alevtina
Fučíková Eva
Hajič Jan
Hajičová Eva
Havelka Jiří
Hlaváčová Jaroslava
Homola Petr
Ircing Pavel
Kettnerová Václava
Klyueva Natalia
Kolářová Veronika
Kučová Lucie
Kárník Jiří
Lopatková Markéta
Mareček David
Mikulová Marie
Mírovský Jiří
Nedoluzhko Anna
Novák Michal
Pajas Petr
Panevová Jarmila
Peterek Nino
Poláková Lucie
Popel Martin
Popelka Jan
Romportl Jan
Rysová Magdaléna
Semecký Jiří
Sgall Petr
Spoustová Johanka
Straka Milan
Straňák Pavel
Synková Pavlína
Toman Josef
Urešová Zdeňka
Vidová Hladká Barbora
Zeman Daniel
Zikánová Šárka
Ševčíková Magda
Šindlerová Jana
Štěpánek Jan
Štěpánková Barbora
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 01/01/2020
Field of study

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

Author: Bejček Eduard
Buráňová Eva
Bémová Alevtina
Fučíková Eva
Hajič Jan
Hajičová Eva
Havelka Jiří
Hlaváčová Jaroslava
Homola Petr
Ircing Pavel
Kettnerová Václava
Klyueva Natalia
Kolářová Veronika
Kučová Lucie
Kárník Jiří
Lopatková Markéta
Mareček David
Mikulová Marie
Mírovský Jiří
Nedoluzhko Anna
Novák Michal
Pajas Petr
Panevová Jarmila
Peterek Nino
Poláková Lucie
Popel Martin
Popelka Jan
Romportl Jan
Rysová Magdaléna
Semecký Jiří
Sgall Petr
Spoustová Johanka
Straka Milan
Straňák Pavel
Synková Pavlína
Toman Josef
Urešová Zdeňka
Vidová Hladká Barbora
Zeman Daniel
Zikánová Šárka
Ševčíková Magda
Šindlerová Jana
Štěpánek Jan
Štěpánková Barbora
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 01/01/2020
Field of study

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University